Author : Indumathi Pandiyan

Ensemble Learning Project - 3rd Project submitted for PGP-AIML Great Learning on 19-Dec-2021

PART A

DOMAIN: Telecom

CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

• DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes information about:

• Customers who left within the last month – the column is called Churn
• Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
• Demographic info about customers – gender, age range, and if they have partners and dependents

•PROJECT OBJECTIVE: To Build a model that will help to identify the potential customers who have a higher probability to churn. This helps the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategizing customer retention.

Import All the Libraries

Revision History: 12-12-2021 - Label Encoding enhanced

STEPS AND TASK [30 Marks]:

Data Understanding & Exploration: [5 Marks]

A. Read ‘TelcomCustomer-Churn_1.csv’ as a DataFrame and assign it to a variable. [1 Mark]

Observation 1: There are 7043 Observations / Rows and 10 Attributes / Columns in churn1 dataset

B. Read ‘TelcomCustomer-Churn_2.csv’ as a DataFrame and assign it to a variable. [1 Mark]

Observation 2: There are 7043 Observations / Rows and 12 Attributes / Columns in churn2 dataset

C. Merge both the DataFrames on key ‘customerID’ to form a single DataFrame [2 Mark]

Observation 3 : There are totally 21 variables which includes 20 independent variable and one target variable Churn

Observation 4: Only 3 (SeniorCitizen, tenure, MonthlyCharges) of the columns are numerical that is been seen here in describe.

Observation 4: The variables like gender, partner,dependent etc are objects datatype for model building, converting to categorical datatype

Comments : All the object datatype are converted to categorical datatypes

D. Verify if all the columns are incorporated in the merged DataFrame by using simple comparison Operator in Python. [1 Marks]

Comments

isin() will check whether all columns of Churn1 available in cdata. if any of the column name missing it show False In the below code, checks whether the churn1 and churn2 columns are available in the new in new dataframe built.

Observation 5 :It is been verified that all the columns of churn1 and churn2 are available in the new dataset cdata

2. Data Cleaning & Analysis: [15 Marks]

EDA Discriptive Statistics

Mean

Observation 6 : There are no duplicates available in the dataset

A. Impute missing/unexpected values in the DataFrame. [2 Marks]

comments: There is no missing values found.

Observation:

  1. CustomerID is the identifier hence its a String object.
  2. gender having values male and female. There are no unexpected values
  3. Partner specifies whether the customer has Partner or not, its having only Yes and No values
  4. Dependents also has Yes or No to specify whether the customer has Dependents are not
  5. PhoneService also has Yes or No value to specify customer availed phone service are not
  6. MultipleLines has 3 values, Yes, No and No Phoneservice. This specify the customer will have multiline only when they has Phone service. So no unexpected values
    7.InternetService has 3 values DSL, FiberOptic and No. This specifies if customer has internet service he has Either DSL or Fiber Optic.
    8.OnlineSecurity, OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies has 3 values Yes, No and No Internet service. These services are available only for people having internet service. So no unexpected values.
  7. Contract has 3 values Month-to-month, Two year and One Year. The contract is either one year, two year or month to month. No unexpected values
    10.PaperlessBilling has either Yes or No values
  8. PaymentMethod has four values Electronic check,Mailed check,Bank transfer (automatic) and Credit card (automatic)
  9. Total charges represents the total charges, it should be continuous variable to be converted to Numerical type

B. Make sure all the variables with continuous values are of ‘Float’ type. [2 Marks]

Making the copy of original data set before doing any changes

Converting the TotalCharges to the Float datatype as it is been observed as categorical type in previous analysis

Comments: Datatype of Total charges been converted to float datatype

Comments: There are 11 null values observed in Totalcharges

Comments: Its observed that Total charges is result of multiplication of Tenure and Monthly Charges

Comments: As per previous understanding Totalcharges is multiplication of tenure and Monthly charges. As tenure is zero for all these null values assumption made is they are not still completed the required tenure[may be not completed a month]. Hence imputed Total charges with Monthly charges

Comments: Tenure is 0 for all these columns so imputed the total charges with Monthly charges

Obervation :
SeniorCitizen, tenure, monthly charges and Total charges are numerical columns. Senior Citizen is categorical, Senior Citizen or not.

Tenure - is having values between 9 and 72. Mean(32.37) is greater than the Median(29.00), so the distribution is positively skewed.
Monthly charges - Having values between 18.25 to 118.75, mean(64.76) is less than median(70.35) and its negatively skewed
Total charges - Having values between 18.80 to 8684. mean(2279.79) is greater than the Median(1394.55)so the distribution is positively skewed

Reference

Positive skewness: if the mean is greater than the median, the distribution is positively skewed.
Negative skewness: If the mean is less than the median, the distribution is negatively skewed.

Comments : Continuous values like Monthly charges and Total charges are in float datatype

C.Create a function that will accept a DataFrame as input and return pie-charts for all the appropriate Categorical features. Clearly show percentage distribution in the pie-chart. [4 Marks]

Function for building the pie chart accept only cdata

D. Share insights for Q2.c. [2 Marks]

Follwing are the categorical features in the customerdata set
1. gender : Female and Male customers are almost equal(Male - 50.48% and female -49.52%)
2. Partner : Customers having partner also almost equal (Yes -51.70% and No -48.30%)
3. Dependents : Only one third of the Customers having dependents and 2/3rd dont have dependents (No - 70% and Yes- 30%)
4. PhoneService : Almost 90% of people has phone service.
5. MultipleLines: Out of 90% of population has phone service, 42% has multiple lines and 48% do not have Multilines
6.InternetService : Nearly 22% of the population does not have Internet services. And in the 78% of remaining population 44% has fibre optic service and nearly 34 % has DSL service.

Following are the internet based services and so observation made on how many percentage got the specific internet based service
6.a OnlineSecurity: nearly 50% does not have online security and 28% has the online security.
6.b OnlineBackup : 44% does not have OnlineBackup and 34% has online backup
6.c.DeviceProtection: Similarly 44% does not have Device protection and 34% has Device Protection
6.d.Techsupport : 49% does not have TechSupport Service and 29% has Techsupport service
6.e.StreamintTV : 38% has StreamingTv and 40% doesnot have Streaming TV
6.f.StreamingMovies: 38.7 has StreamingMovies and 39.3% doesnot have Streaming Movies

7.Contract : More percentage (55%) of customers opted for Month to month contract followed by Two year(24%) and one year(21%) respectively.
8.PaperlessBilling : 60% people prefers paperless billing where as nearly 40% not have paperless bill
9.PaymentMethod : ElectronicCheck(33.5%) and mailed check 22.8% followed by Bank transfer(21%) and Credit card(21%)
10.Churn : Churn rate is 26.5%

E. Encode all the appropriate Categorical features with the best suitable approach. [2 Marks]

Verifying the categorical values for inputation

heatmap to understand correlation between all features

F.Split the data into 80% train and 20% test. [1 Marks]

Split 80% Train and 20% Test data

G.Normalize/Standardize the data with the best suitable approach. [2 Marks]

3. Model building and Improvement: [10 Marks]

A. Train a model using XGBoost. Also print best performing parameters along with train and test performance. [5 Marks]

B. Improve performance of the XGBoost as much as possible. Also print best performing parameters along with train and test performance. [5 Marks]

Improve XGBoost with K-Fold Crossvalidation

In k-fold cross validation all the entries in the original training dataset are used for both training as well as validation. Also, each entry is used for validation just once. XGBoost supports k-fold cross validation via the cv() method. specify the nfolds parameter, which is the number of cross validation sets you want to build. Also, it supports following other parameters

num_boost_round: denotes the number of trees you build (analogous to n_estimators)
metrics: tells the evaluation metrics to be watched during CV
as_pandas: to return the results in a pandas DataFrame.
early_stopping_rounds: finishes training of the model early if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds.
seed: for reproducibility of results.

cv_results contains train and test RMSE metrics for each boosting round

final boosting round metric

PART B

• DOMAIN: IT

• CONTEXT: The purpose is to build a machine learning pipeline that will work autonomously irrespective of Data and users can save efforts involved in building pipelines for each dataset.

• PROJECT OBJECTIVE: Build a machine learning pipeline that will run autonomously with the csv file and return best performing model.

• STEPS AND TASK [30 Marks]:

  1. Build a simple ML pipeline which will accept a single ‘.csv’ file as input and return a trained base model that can be used for predictions. You can use 1 Dataset from Part 1 (single/merged).
  2. Create separate functions for various purposes.
  3. Various base models should be trained to select the best performing model.
  4. Pickle file should be saved for the best performing model.
    Include best coding practices in the code: • Modularization • Maintainability • Well commented code etc

Purpose

  1. Build Simple ML pipeline
  2. Functions for various purpose
  3. train base model and choose bet performing model
  4. Picke file should be saved for best performing model

Approach to Build the Pipeline to get a data file and return best Performing model

Approach Description:

For Building a Pipeline following approach is taken. The Main code will pass the data file to the subclasses. The subclasses is modularized to do the following activity

Modularization:
1. Data Analysis
2. Data Transformation
3. Data Visualization
4. Split Test and Train
5. Model Building and returning the best model

Import the required Libraries

Class Description

for the purpose of modularizing and understanding there are 3 subclasses classes defined

Preprocessor - Class will be responsible for doing all the preprocessing before Model Building. It will use below classes to do the functionalities

1. DataAnalyser - To Analyse the data
2. DataVisualizer- To visualize the data in graphs and charts
3. DataTransformer - To do the clean up, imputation etc and give the required X and Y set for model Building

DataPreprocessor - is the Main class which would use all the 3 above classes to do the necessary preprocessing

Function to load the data from csv file

Data Analyser class has the required methods for Analysing the data.

This includes

1. Checking the shape, 5 point summary<br>
2. verifying the Null values<br>
3. Finding categorical columns<br>
4. Method to convert Object to numeric<br>
5. Find column which is index -which expected to be dropped<br>
6. Find numerical column if in Object data Type<br>

Class DataTransformer has the methods for doing required changes in data

Class DataVisualizer has the methods to Visualize the data in form of Graphs and charts

PreProcessor Class do all the necessary work for Model building

Main Module to Process the Data and Build the Model

Best Model is Identified Through Pipe Line Is Logistic Regression

Save Model To a File Using Python Pickle

Load Saved Model

Observation And Conclusion:

Through the pipe Line created able to send a dataframe and can identify the best performing model based on Accuracy. pipeline library is the industry approach used to predict the models by sending the preprocessed data.

The best model is identified and same is saved in the pickle file which can be used for further prediction. In this product we can able to save the model and able to retrieve. When we used for prediction same accuracy received. By this way the time spent for training can be drastically reduced.